Pricing

from $5.00 / 1,000 text chunkeds

Text Splitter & Chunker for RAG / LLMs

Split text into clean, overlapping chunks ready for embeddings, vector databases, RAG and LLM context. Configurable size, overlap, and split strategy.

Pricing

from $5.00 / 1,000 text chunkeds

Rating

0.0

(0)

Developer

Rosario Vitale

Actor stats

Bookmarked

Total users

Monthly active users

4 days ago

Last modified

Why

Every RAG / LLM pipeline needs chunking, and everyone re-implements the same fiddly logic: respect paragraph and sentence boundaries, keep an overlap so context isn't lost, normalize messy whitespace, and estimate tokens. This Actor does it for you, reliably, in one call.

Features

✂️ Smart chunking — packs text up to your target size while respecting paragraph/sentence boundaries.
🔁 Overlap — keeps a configurable overlap so ideas spanning a boundary aren't lost.
🔢 Characters or tokens — size and overlap in characters or approximate tokens (~4 chars/token).
🧹 Cleaning — normalizes whitespace and collapses excessive blank lines.
📦 Batch — split many documents in a single run.
📊 Token estimate — every chunk includes charCount and approxTokens.

Input

Field	Type	Description
`text`	string	A single document to split.
`texts`	array	Multiple documents (one per item).
`chunkSize`	integer	Target chunk size. Default `1000`.
`chunkOverlap`	integer	Overlap between chunks. Default `100`.
`unit`	select	`characters` or `tokens`. Default `characters`.
`splitBy`	select	`paragraph`, `sentence` or `character`. Default `paragraph`.
`clean`	boolean	Normalize whitespace. Default `true`.

Example input

{
    "text": "Your long document text goes here...",
    "chunkSize": 1000,
    "chunkOverlap": 100,
    "unit": "characters",
    "splitBy": "paragraph",
    "clean": true
}

Output

One dataset item per chunk:

{
    "sourceIndex": 0,
    "chunkIndex": 0,
    "totalChunks": 3,
    "text": "Retrieval-Augmented Generation (RAG) combines a language model ...",
    "charCount": 312,
    "approxTokens": 78
}

Export as JSON, CSV, or Excel, or pull via the Apify API — then send the chunks straight to your embeddings model or vector DB.

Common use cases

Prepare documents for embeddings + vector search (Pinecone, Qdrant, Weaviate, pgvector).
Build RAG context for ChatGPT/Claude apps.
Fit long content into LLM context windows.
Pairs perfectly with PDF to Structured Data — extract text from PDFs, then chunk it here.

Notes

Token counts are an estimate (~4 characters per token); exact tokenization depends on the model.
For character split mode the text is hard-cut at the size boundary; paragraph/sentence respect natural boundaries.

RAG-Ready Markdown Converter & Chunker

foxpink/apify-rag-markdown-chunker

Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.

Nguyễn Anh Duy

4.7

RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases

adinfosys-labs/rag-ready-web-scraper-smart-chunker-for-ai-knowledge-bases

RAG-ready web scraper that collects, cleans, deduplicates, filters, and chunks web content into structured datasets for AI pipelines. Generates high-quality knowledge-base data optimized for LLMs, embeddings, and vector databases

Artashes Arakelyan

Rag Content Chunker

labrat011/rag-content-chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summary—ready for embeddings or vector DBs without extra glue code.

mick_

RAG Text Chunker — heading & sentence aware, Japanese ready

shoebill-dev27/rag-text-chunker

Split Markdown or plain text into retrieval-ready chunks for RAG pipelines: cuts at headings, packs whole sentences up to a size limit with optional overlap, and tags every chunk with its heading breadcrumb. Handles Japanese sentence boundaries. No LLM cost.

Shinobu Otani

Rag Embedding Generator

labrat011/rag-embedding-generator

Generate vector embeddings from text or chunked datasets using OpenAI or Cohere. Chains with RAG Content Chunker for end-to-end RAG pipelines. Outputs raw vectors ready for any vector database.

mick_

PDF to Text API | Document Extraction for LLMs & RAG

andok/pdf-text-converter

Convert bulk PDF documents via URL into clean, raw text. The perfect document scraper for LLMs, vector databases, and RAG pipelines.

Andok

Website to Text & Markdown — AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

Hitman studio

AI Context Fetcher: Clean Text for RAG

sarvesh_bijawe/ai-context-fetcher-clean-text-for-rag

Instantly extracts clean, ad-free text from any URL. Designed for AI Agents, RAG pipelines, and LLM context windows.

Sarvesh Bijawe

Website to Markdown for LLM and RAG

jeweled_jockstrap/my-actor-3

Convert any URL to clean Markdown text for AI applications. Strips HTML extracts content. For LLM training RAG pipelines and vector databases. Free Firecrawl alternative.